The format of this site that we will create 4 different methods of image classification of handwritten digits. First we will do some Data Visualization to help understand the problem. Then we will train and test Neural Networks, Gradient Boosting, and Random Forests to see how they each perform with some tuning. Aftewords we will look at a majority vote between the three algorithms to see if we can have any improvement.
Note: You will see libraries loaded several times, and variables reused. This is because I ran these in 5 seperate IPython Notebooks and 1 R script(for 3D Data Visualization).
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
This IPython notebok is just to get a sense of the data and then summarize the results that we have seen attempting different machine learning methods. Below is a plot of some of the handwritten numbers by converting the numbers into matricies and then vertical and horizontal stacking the images.
from numpy import genfromtxt
#Load in the optical data
optical = genfromtxt('train.csv', delimiter=',')
#Trim of headers
optical = optical[1:][:]
a = np.vstack([printnum(optical[0][1:]),printnum(optical[1][1:]),printnum(optical[2][1:]),printnum(optical[15][1:]),printnum(optical[20][1:])])
b = np.vstack([printnum(optical[3][1:]),printnum(optical[4][1:]),printnum(optical[5][1:]),printnum(optical[16][1:]),printnum(optical[21][1:])])
c = np.vstack([printnum(optical[6][1:]),printnum(optical[7][1:]),printnum(optical[8][1:]),printnum(optical[17][1:]),printnum(optical[22][1:])])
d = np.vstack([printnum(optical[9][1:]),printnum(optical[10][1:]),printnum(optical[11][1:]),printnum(optical[18][1:]),printnum(optical[23][1:])])
e = np.vstack([printnum(optical[12][1:]),printnum(optical[13][1:]),printnum(optical[14][1:]),printnum(optical[19][1:]),printnum(optical[24][1:])])
f = np.hstack([a,b,c,d,e])
plt.imshow(f, cmap = plt.cm.Greys_r)
plt.axis('off')
plt.title('Handwritten Digits')
plt.savefig('HandwrittenDigits.png', bbox_inches = 'tight')
We can see that there are some difficult numbers such as the very first image and even the 5's can look like 6s. This suggests that we could see some difficulty in grabbing the outliers within the dataset. As always there are tradeoffs and maybe to grab htese outliers.
We will attempt Neural Networks which are known to be good for image classification because of their ability to find features within an image. While these can have large training times, we will at least attempt to tune a Neural Network to where the performance is good but not necessarily optimal.
Gradient Boosting may be helpful, althought this could add far to much complexity to catch the edge cases.
Also Random Forests may be a good option because if we use many(~500) random criteria within a decision tree we may be able to catch outliers as well.
#Read in data into a Data Frame
df = pd.read_csv("train.csv")
import matplotlib
matplotlib.style.use('ggplot')
df.ix[:,0].value_counts().plot(kind='bar')
plt.title("Histogram Plot of Labels")
plt.xlabel("Number Written")
plt.ylabel("Count")